Investigation of Wine Quality by T.Y. Chen

Some notes:
  • I used different color (instead of a gradient of color) for different level of “quality” as I found it clearer for me to recognize the relationship~ (gradient of color for me is a bit difficult to read) (and using different kind of line instead of gradient of color is of same reason)

  • Thanks for the suggestion. However I found it easier to comprehend if I discuss all the graphs together at the end of each kind of graph, as some of the vars have quite similar properties.

  • What I’m gonna do is to first plot all possible plots in the dataset, then spot some specific plot/ vars to discuss further. I think this preserve the fact that we can also look at all the vars alltogeter, while still preserves some sort of reader-friendliness

Thank you!

Univariate Plots Section

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Analysis

graph summary

  • Shape: approx. normal; skewed right; however don’t have much observation around 10
  • Center: around 6-8
  • Spread: pretty wide; approx 6-16
  • Outliers: a few at 15-17

graph summary

  • Shape: approx. uniform distributed; with decreasing number of observations on the right side
  • Center: around 0.3-0.7
  • Spread: 0.2-1.2
  • Outliers: a few at 1.6

graph summary

  • Shape: quite random; however the number start to decrease from 0.5 on
  • Center: hard to say; maybe around 0.3
  • Spread: 0-0.75
  • Outliers: at 1

graph summary

  • Shape: bell-shaped; right-skewed
  • Center: approx.1.5
  • Spread: 0-6
  • Outliers: at 8,11,13-15,16

graph summary

  • Shape: similar to last one; bell-shaped; right-skewed
  • Center: approx. 0.1
  • Spread: 0-0.2
  • Outliers: at 0.2,0.3-0.4,0.6

graph summary

  • Shape: as value increases, the number of observation decreases; right-skewed
  • Center: hard to say; approx. 15
  • Spread: 0-50
  • Outliers: at 65-70

graph summary

  • Shape: as value increases, the number of observation decreases; right-skewed
  • Center: hard to say; approx. 50
  • Spread: 0-150
  • Outliers: at 300

graph summary

  • Shape: normally distributed; skewed a bit left
  • Center: 0.997
  • Spread: 0.990-1.005
  • Outliers: at 0.990

graph summary

  • Shape: normally distributed; pretty symmetric
  • Center: 3.3
  • Spread: 2.8-3.7
  • Outliers: at 4

graph summary

  • Shape: approx. normally distributed; skewed right
  • Center: 0.7
  • Spread: 0.4-1.2
  • Outliers: at 1.6,2

graph summary

  • Shape: as value increases, the number of observation decreases; right-skewed
  • Center: hard to say; 12
  • Spread: 9-14
  • Outliers: at 15

summary

  • most of the vars distributed normally (bell-shaped)
  • sulphates, total sulfur dioxide, free sulfur dioxide, chlorides, residual sugar are all skewed right

What is the structure of your dataset?

1599 obs. 12 vars: 11 independent, 1 dependent vars.

What is/are the main feature(s) of interest in your dataset?

“quality” is the main feature of interest. Other vars serves as classfier for this var.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Four main categories:

  • basic characterstics: pH, density, alcohol
  • acidity: fixed.acidity, volatile.acidity, citric.acid
  • (sulfur) dioxide: free.sulfur.dioxide, total.sulfur.dioxide
  • other flavor: residual.sugar, chlorides

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

No.

Bivariate Plots Section

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

explanation

Since the dependent var is “quality”, I plot all other vars againt quality. However we have to keep in mind that “quality” basically has a normally distribution. The mediocre wine has the greatest number.

For the bar chart part:

  1. “residual.sugar” has a interesting distribution. Basically those extremely high (8,9) or low (2,3) quality wine are low in this vars, while mediocre wine has high value on this var.
  2. “chlorides” and “sulphates” has quite similar property with “residual.sugar”
  3. “free.sulfur.dioxide” also has some quite similar property. However, its closely related vars “total.sulfur.dioxide” behave a bit differently. While the wine quality increases as it increases, the best wine (8,9) has low “total.sulfur.dioxide”.
  4. “density” and “pH” are quite unrealted to quality
  5. “fixed.acidity” has weak linkage, however bad wine do have lower fixed.acidity.
  6. “volatile.acidity” basically gets lower when quality increases
  7. “alcohol” goes up as “quality” goes up

For the boxplot + jitter graph:

  • quality “5” & “6”" has lots of outlier in most vars
  • 5,6 (regarding acidity) is further verified
  • citric.acid gets higher as quality gets better
  • residual sugar basically remains quite the same across all quality, the 1 claim might be a effect of some outliers in higher quality wine. so does chlorides
  • “free.sulfur.dioxide” has these “bell-shaped” distribution as stated in 3, while the “total.sulfur.dioxide” are actually more “bell-shaped” than stated
  • 4 (density,pH) is further verfied
  • sulphates actually gets higher as quality increases
  • alcohol does go up as quality goes up, however low quality wine has relatively high alcohol

individual plots

graph summary

  • relationship: sulphates gets higher as the quality increases
  • distribution & outlier: most of the observation lies within the box range. but for quality 5,6, there’s a lot of outliers

graph summary

  • relationship: total.sulfur.dioxide gets higher as the quality increases; but the best wine have low total.sulfur.dioxide
  • distribution & outlier: for wine quality 5,6, the values are skewed left (smaller value); also, for quality 6, there’s lots of outlier

graph summary

  • relationship: bell-shaped
  • distribution & outlier: quite evenly; however, for quality 5,6 the spread is quite wide

graph summary

  • relationship: bell-shaped

graph summary

  • relationship: bell-shaped
  • distribution & outlier: spread is narrow, which means the medium is a good representation of the whole dataset; however, quality 5,6 has quite a few outlier

graph summary

  • relationship: inverted-bell-shaped
  • distribution & outlier: spread is quite wide

graph summary

  • relationship: citric.acid gets higher as the quality increases
  • distribution & outlier: spread is quite wide; however for quality 5,6, there’s lots of observations are at the botton (0)

summary

  • “fixed.acidity”: has weak linkage with quality, however bad wine do have lower fixed.acidity
  • “volatile.acidity”: gets lower when quality increases (strong)
  • “citric.acid”: gets higher as quality gets better
  • “residual.sugar”: (relatively) unrelated
  • “chlorides”: (relatively) unrelated
  • “free.sulfur.dioxide”: bell-shaped
  • “total.sulfur.dioxide”: bell-shaped
  • “density”: (relatively) unrelated
  • “pH”: (relatively) unrelated
  • “sulphates”: gets higher as quality increases
  • “alcohol”: alcohol does go up as quality goes up, however low quality wine has relatively high alcohol

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

(using some of the graphs in the next section)

- positive linear:

  • “fixed.acidity”, “citric.acid” -> strong
  • “fixed.acidity”, “density” -> strong
  • “citric.acid”, “fixed.acidity”

- negative linear:

  • “fixed.acidity”, “pH” -> strong
  • “citric.acid”, “fixed.acidity”
  • “citric.acid”,“pH”
  • “density”,“alcohol”

- other

  • “citric.acid”, “volatile.acidity”
  • “free.sulfur.dioxide”,“total.sulfur.dioxide” -> strong

What was the strongest relationship you found?

with quality:

“volatile.acidity”

with other vars:

  • “fixed.acidity”, “citric.acid”
  • “fixed.acidity”, “density”
  • “fixed.acidity”, “pH”
  • “free.sulfur.dioxide”,“total.sulfur.dioxide”

Multivariate Plots Section

Multivariate Analysis

note: the numbers are the mutual info of the respective vars pairs

## [1] "the mutual value of"
## [1] "fixed.acidity"
## [1] "citric.acid"
## [1] 0.3263575

## [1] "the mutual value of"
## [1] "fixed.acidity"
## [1] "density"
## [1] 0.3268869

## [1] "the mutual value of"
## [1] "fixed.acidity"
## [1] "pH"
## [1] 0.3384619

## [1] "the mutual value of"
## [1] "citric.acid"
## [1] "fixed.acidity"
## [1] 0.3263575

## [1] "the mutual value of"
## [1] "free.sulfur.dioxide"
## [1] "total.sulfur.dioxide"
## [1] 0.355175

## [1] "the mutual value of"
## [1] "total.sulfur.dioxide"
## [1] "free.sulfur.dioxide"
## [1] 0.355175

## [1] "the mutual value of"
## [1] "density"
## [1] "fixed.acidity"
## [1] 0.3268869

## [1] "the mutual value of"
## [1] "pH"
## [1] "fixed.acidity"
## [1] 0.3384619

quick summary

I found that the relationships in “fixed.acidity”,“volatile.acidity”, “citric.acid”,“sulphates” and “alcohol” are quite interesting. Thus, I decided to look further into these plots.

graph summary

  • the three line’s not overlapping, which means different quality wine do have different characterstics on these two vars
  • however, the line is rather flat, which means quality is primarily related with the sulphates

graph summary

  • the three line’s not overlapping, which means different quality wine do have different characterstics on these two vars
  • the worst quality wine has the most volatile.acidity, however its relationship is a bit more complicated and is represented by the dotted line in the graph

graph summary

  • the best quality wine’s has either low alcohol and high fixed.acidity, or high alcohol and low fixed.acidity
  • for medium and low quality wine, it’s rather random

graph summary

  • the three line’s not overlapping except when sulphates is larger than 1.5
  • the best quality winehas the lowest volatile.acidity

graph summary

  • the best wine has high alcohol, for the others it’s more random

graph summary

  • the worst wine has highest volatile.acidity

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I did two kinds of plotting:
  • did a scatter plot of all two paired vars, with different quality colored differently
  • scatter plot some vars of choice and add a smooth line with different quality marked with different line
Result:

Most of the vars don’t really strengthen when combined. We’ll see a straight line in the graph, indicating a only one of the var (instead of both) is influencing the result.

Were there any interesting or surprising interactions between features?

I found the interaction between alcohol and fixed.acidity particularly interesting. As fixed.acidity and alcohol both get higher, the quality of the wine increases.


Final Plots and Summary

Plot One

Description One

  • The median (as well as the “box”) of citric.acid gets larger when the quality gets better, indicating a positive correlation between citric.acid and quality.
  • However, the full range (the “line”) remain of the same length and position, indicating tha some outlier observations influencing the result.
  • When quality is set to either 5 or 6, the distribution of the “dot”s are quite similar, indicating the fact that medium quality wine are all quite similar.

Plot Two

Description Two

  • We can see that the “box” and the median decrease as the volatile.acidity decreases.
  • Quality 7 and 8 has similar property when it comes to volatile.acidity.
  • However, when quality is set to either 5 or 6, there’s tons of outliers, might cause some potential problems. This group of data might not be properly represented in this graph.

Plot Three

Description Three

  • The high quality wine’s line is mostly larger (closer to right) than the low and medium quality wine, indicating that the high quality wine has a overall higher (alcohol + fixed.acidity)
  • However, most wine are in the “low alcohol low fixed.acidity” category, and are labeled green to indicate lower quality.

Reflection

When dealing with this dataset, there are two major problem that I found particularly challenging: 1) How to use the right kind of graph to represent the data: there are ton’s of kinds of graphs out there, however some of them might be quite hard to read while other might not accurately represent the data. Furthermore, even if some kind of graph is useful when evaulating certain kinds of data, I might not be sufficiently familiar with that kind of graph to use it well. 2) How to determine whether a particular kind of relationship is “interesting” enough: When dealing with these graph, while some of the relationship are strong enough to be identified, others might just be so vague that I don’t know whether to call those “interesting” or not.

Limits when dealing with this dataset:
  • I just used plain eyes (plus some help from the “smooth” line function which plots some regression-style line on the graph), so the result might be too subjective
  • In this dataset, it’s quite hard to find any complicated relationships since all the vars are rather simple and don’t have any extra dimensions (e.g. time) to it.
  • Since I don’t really know wine, I don’t really know how to interpret the result.
Some suggestions:
  • I think using machine learning to recognize relationship will be a more efficient and more objective approach. Since our dependent variable is a ordinal var, we can either 1) convert it to a binary var (e.g. high quality or not) and use classfiers such as decision tree or 2)try to use regression algos to predict the result
  • the relationship between “alcohol” and “fixed.acidity” is quite interesting and worth further investigation.